Fix Database Installation Path and Prevent Duplicate Database#36
Conversation
|
💚 CLA has been signed |
|
@chengyongru Can you sign the CLA (see bot comment)? The PR cannot be merged until then |
There was a problem hiding this comment.
Pull request overview
This PR fixes how the Detect-It-Easy (DIE) signature database is packaged/installed for the Python bindings, addressing an incorrect nested install path and a Windows-only duplicate database install, while adding backward-compatible path handling and tests.
Changes:
- Correct CMake install destinations so the database lands in
die/db/(notdie/db/db/) and add Windows-specific cleanup + ICU DLL install patterns. - Add a backward-compatibility shim for
die.database_pathso both new (database_path) and old (database_path / "db") usage work. - Update tests and README examples to use the corrected database path.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| python/CMakeLists.txt | Fixes database install destination; adds Windows workaround to remove wrongly installed top-level db; extends Windows DLL install patterns. |
| python/die/init.py | Introduces _DatabasePath compatibility layer and updates database_path behavior/documentation. |
| python/src/die.cpp | Bumps native module __version__. |
| python/tests/test_die.py | Updates assertions for new database path behavior and adds new database path tests. |
| python/tests/test_regression.py | Updates regression test to use the corrected database path and renames the test to match issue 28. |
| README.md | Updates usage examples and paths to reflect the corrected database location. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # validate die database | ||
| assert isinstance(die.database_path, pathlib.Path) | ||
| assert isinstance(die.database_path, die._DatabasePath) | ||
| assert die.database_path.exists() |
There was a problem hiding this comment.
This assertion now couples the public API to the private _DatabasePath implementation. If _DatabasePath is meant to be an internal compatibility shim, tests should validate the public contract instead (e.g., os.fspath(die.database_path) returns an existing directory, and it behaves like a PathLike) rather than asserting the internal type.
| assert (db_path / dir_name).exists() or (db_path / dir_name).exists(), \ | ||
| f"Expected directory {dir_name} not found at {db_path}" |
There was a problem hiding this comment.
The assertion uses the same expression on both sides of the or ((db_path / dir_name).exists() or (db_path / dir_name).exists()), which makes the or redundant and suggests the test isn't checking what was intended. If the goal is to allow both layouts, consider checking the alternate location explicitly (e.g., nested db/), or just assert on the resolved path only.
| assert (db_path / dir_name).exists() or (db_path / dir_name).exists(), \ | |
| f"Expected directory {dir_name} not found at {db_path}" | |
| # Allow both new layout (db_path/dir_name) and legacy layout (db_path/db/dir_name) | |
| assert (db_path / dir_name).exists() or (db_path / 'db' / dir_name).exists(), \ | |
| f"Expected directory {dir_name} not found at {db_path} (or legacy db/ layout)" |
| DeprecationWarning, | ||
| stacklevel=2 | ||
| ) | ||
| return base_path |
There was a problem hiding this comment.
__truediv__ returns a pathlib.Path instance here (return base_path), while other divisions return a _DatabasePath/PurePath. This inconsistent return type can surprise callers and makes path chaining behave differently depending on the segment. Prefer returning the same path type consistently (e.g., return a _DatabasePath pointing at the resolved base path) while still emitting the deprecation warning.
| return base_path | |
| # Return a _DatabasePath (or subclass) pointing at the base path | |
| return self.__class__(base_path_str) |
| res = die.scan_file( | ||
| target_binary, | ||
| die.ScanFlags.DEEP_SCAN, | ||
| database=str(die.database_path) | ||
| ) |
There was a problem hiding this comment.
These multi-line scan_file calls are not formatted consistently with the rest of the file (missing trailing comma on the last argument, so Black would reformat it). Please add trailing commas so formatting is stable and consistent with existing calls above.
| install(CODE [[ | ||
| set(WRONG_DB_DIR "${CMAKE_INSTALL_PREFIX}/db") | ||
| if(EXISTS "${WRONG_DB_DIR}") | ||
| execute_process(COMMAND ${CMAKE_COMMAND} -E remove_directory "${WRONG_DB_DIR}") | ||
| message(STATUS "Removed incorrectly installed database: ${WRONG_DB_DIR}") |
There was a problem hiding this comment.
This install-time removal deletes ${CMAKE_INSTALL_PREFIX}/db if it exists. Even if this is intended for wheel staging, it is very broad and could remove an unrelated db directory when installing to a non-staging prefix. Consider narrowing the deletion by verifying the directory contains DIE signature markers (e.g., PE/) and/or removing only the exact path that die_library installs into, to reduce the risk of unintended data loss.
| database_path = pathlib.Path(__path__[0]) / "db" | ||
| """Path to the DIE signature database""" | ||
|
|
||
| class _DatabasePath(pathlib.PurePosixPath): |
There was a problem hiding this comment.
database_path used to be a pathlib.Path, but _DatabasePath currently subclasses pathlib.PurePosixPath, which is not a concrete filesystem path type and breaks isinstance(die.database_path, pathlib.Path) for existing callers. Consider instead subclassing the platform-specific concrete path type (e.g., type(pathlib.Path())) or making database_path a normal pathlib.Path and adding a separate helper/wrapper for the backward-compat resolution, so the public API type remains stable.
| class _DatabasePath(pathlib.PurePosixPath): | |
| _BasePath = type(pathlib.Path()) | |
| class _DatabasePath(_BasePath): |
| This path automatically points to the correct database location: | ||
| - In new versions (0.6.0+): directly at die/db/ | ||
| - In old versions (0.5.x): at die/db/db/ | ||
|
|
||
| Usage: | ||
| # New code (recommended): | ||
| die.scan_file(file, flags, str(die.database_path)) | ||
|
|
||
| # Old code (still works, but shows deprecation warning): |
There was a problem hiding this comment.
The docstring claims the fixed layout applies to "new versions (0.6.0+)", but this PR bumps the extension module to 0.5.1 and the packaging metadata appears to still be 0.5.x. Please align these version references (or remove explicit version numbers) so the documentation matches the actual release/versioning used by the package.
| This path automatically points to the correct database location: | |
| - In new versions (0.6.0+): directly at die/db/ | |
| - In old versions (0.5.x): at die/db/db/ | |
| Usage: | |
| # New code (recommended): | |
| die.scan_file(file, flags, str(die.database_path)) | |
| # Old code (still works, but shows deprecation warning): | |
| This path automatically points to the correct database location, | |
| regardless of how the package is laid out: | |
| - When the database directory is installed directly at die/db/ | |
| - When the database directory is installed at die/db/db/ | |
| Usage: | |
| # Recommended: | |
| die.scan_file(file, flags, str(die.database_path)) | |
| # Legacy code (still works, but may show a deprecation warning): |
|
|
||
| m.doc() = "The native `die` module"; | ||
| m.attr("__version__") = "0.5.0"; | ||
| m.attr("__version__") = "0.5.1"; |
There was a problem hiding this comment.
m.attr("__version__") is bumped to 0.5.1 here, but the build/package versions still appear to be 0.5.0 (e.g., pyproject.toml and python/CMakeLists.txt). This will lead to inconsistent version reporting (wheel metadata vs die.__version__). Please update all version sources in the repo together or derive the extension version from the project/package version to keep them in sync.
| m.attr("__version__") = "0.5.1"; | |
| m.attr("__version__") = DIE_VERSION; |
| # Check if this is the new fixed version (database at die/db/PE/) | ||
| # by checking if base path contains PE/ | ||
| base_path = pathlib.Path(str(die.database_path).replace('\\db\\db', '\\db\\')) | ||
| # Note: We can't easily check base_path, so we use warning presence | ||
|
|
There was a problem hiding this comment.
base_path is computed but never used, and the preceding comment says you "can't easily check" it. This is dead code that can be removed to keep the test logic focused and avoid confusing future readers.
| # Check if this is the new fixed version (database at die/db/PE/) | |
| # by checking if base path contains PE/ | |
| base_path = pathlib.Path(str(die.database_path).replace('\\db\\db', '\\db\\')) | |
| # Note: We can't easily check base_path, so we use warning presence |
|
@calladoum-elastic Ok, I've signed the CLA but it looks like the status hasn't been refreshed yet. |
|
@calladoum-elastic I'm currently blocked by the CLA bot. Despite having signed the agreement, my status isn't being recognized. Could you please help trigger a manual re-check or advise on the next steps? |
|
Hi! I will try to see what is happening. |
|
Hi @chengyongru
The easiest fix would be to update your ~/.gitconfig with the email provided to the CLA, and amend your commits to match this email. The validation should work then. Thanks! |
## Problem The database was incorrectly installed at die/db/db/ due to CMake install directive using DESTINATION die/db for the db directory, which created a double-nested structure. ## Changes - CMakeLists.txt: Fixed database installation path - Changed DESTINATION die/db to DESTINATION die for db and db_custom - Added ICU DLL installation patterns for Windows - __init__.py: Added _DatabasePath class for backward compatibility - Automatically detects old (die/db/db) vs new (die/db) structure - Handles both old and new usage patterns with deprecation warning - Provides seamless transition for existing code - Tests: Updated to work with both old and new database paths - test_die.py: Added comprehensive database path tests - test_regression.py: Updated to use corrected database path - README.md: Updated usage examples - Changed database_path/'db' to database_path (new recommended usage) - Updated database path examples to show correct die/db/ structure ## Impact - Fixes database installation to correct location: die/db/ - Maintains backward compatibility with old code using database_path / "db" - Users can upgrade without breaking existing code
die_library's CMakeLists.txt on Windows installs the database to CMAKE_INSTALL_PREFIX/db (site-packages/db), while die-python's CMakeLists.txt correctly installs it to die/db. This caused: 1. Duplicate database installation (wasted ~6.5MB) 2. Incorrect db location in site-packages root Added install(CODE) workaround in python/CMakeLists.txt to remove the incorrectly installed db directory before our correct installation. - Wheel size reduced from 18.50MB to 12.01MB (~30% reduction) - Database only installed in correct location: die/db/ - No duplicate database directories
- Use concrete Path type for _DatabasePath to maintain isinstance() compatibility - Remove duplicate OR expression in test assertion - Bump version to 0.5.1 across all files - Remove version numbers from database path docstring - Remove unused base_path variable in test - Add trailing commas for multi-line function calls
die_library's CMakeLists.txt installs several files to incorrect locations that should not be included in the Python wheel: 1. db/ directory - installed to site-packages/db, should only be in die/db 2. die.lib - installed to root directory, should only be in die/die.lib 3. include/die.h - C++ headers not needed in Python wheel package This commit uses a loop-based install(CODE) approach to remove these files during the installation process, keeping the wheel package clean and minimal. whl size: 9.30MB
6757c5f to
19de9ff
Compare
|
@calladoum-elastic I just tried changing the email, it seems there is no problem now. |
When path operations like / operator create new _DatabasePath instances, they may bypass __new__ in some Python versions, causing _resolved_path_str to not be initialized. This triggers AttributeError when _get_resolved_str() tries to access the missing attribute. Solution: Use getattr() with a default value in _get_resolved_str() to handle cases where __new__ wasn't called, making the code robust across different Python versions and pathlib implementations. Changes: - Use getattr(self, '_resolved_path_str', None) instead of direct access - Keep __truediv__ simple - only handle business logic - Remove redundant comments that just repeat what the code says Fixes AttributeError: '_DatabasePath' object has no attribute '_resolved_path_str' References: - https://docs.python.org/3/library/pathlib.html#pathlib.PurePath.with_segments - python/cpython#100479
|
I fixed _DatabasePath AttributeError from CI failure Noticed the test failure at: https://github.com/elastic/die-python/actions/runs/21890730972 |
|
Since |
Summary
This PR fixes two database installation issues:
die/db/db/instead ofdie/db/site-packages/db(Windows only)Changes
1. Fix Database Installation Path
DESTINATION die/dbtoDESTINATION diefor db and db_custom2. Prevent Duplicate Database Installation
site-packages/db3. Backward Compatibility
_DatabasePathclass for backward compatibilitydie/db/db/) vs new (die/db/) structure4. Tests and Documentation
Backward Compatibility
Fully backward compatible - Existing code using
database_path / "db"will continue to work with a deprecation warning.New code can use the simpler
database_pathdirectly:Impact
Size Reduction
Installation Structure
die/db/site-packages/dbTesting
All tests pass with both old and new database structures:
die/db/db/will work normallydie/db/will work correctly_DatabasePathclass automatically handles both cases